15 research outputs found

    Neural Coreference Resolution for Turkish

    Get PDF
    Coreference resolution deals with resolving mentions of the same underlying entity in a given text. This challenging task is an indispensable aspect of text understanding and has important applications in various language processing systems such as question answering and machine translation. Although a significant amount of studies is devoted to coreference resolution, the research on Turkish is scarce and mostly limited to pronoun resolution. To our best knowledge, this article presents the first neural Turkish coreference resolution study where two learning-based models are explored. Both models follow the mention-ranking approach while forming clusters of mentions. The first model uses a set of hand-crafted features whereas the second coreference model relies on embeddings learned from large-scale pre-trained language models for capturing similarities between a mention and its candidate antecedents. Several language models trained specifically for Turkish are used to obtain mention representations and their effectiveness is compared in conducted experiments using automatic metrics. We argue that the results of this study shed light on the possible contributions of neural architectures to Turkish coreference resolution.119683

    Ön eğitimli dil modelleriyle duygu analizi

    Get PDF
    Duygu analizi, çeşitli platformlarda bir konu hakkında düşünce, duygu ya da tutumu irdelemek, analiz etmek ve yorumlamak amacıyla kullanılan yöntemlerden biridir. Farklı konulardaki metinlerin öznel içeriklerine göre sınıflandırılabildiği duygu analizinde makine öğrenmesi ve derin öğrenme modellerinden sıklıkla faydalanılmaktadır.Bu çalışmada, önceden eğitilmiş dil modellerinden yararlanılarak Covid-19 tweet metinleri üzerinde duygu analizi yapılmıştır. Naive Bayes sınıflandırıcıya ek olarak BERT, RoBERTa ve BERTweet dil modelleri kullanılarak farklı sınıflandırıcılar eğitilmiş ve tweet veri kümesi üzerinde elde edilen sonuçlar kıyaslanmıştır. Bildiride aktarılan çalışmanın ileride bu alanda yürütülecek araştırmalara bir zemin oluşturacağı öngörülmektedir

    Graph-based Turkish text normalization and its impact on noisy text processing

    Get PDF
    User generated texts on the web are freely-available and lucrative sources of data for language technology researchers. Unfortunately, these texts are often dominated by informal writing styles and the language used in user generated content poses processing difficulties for natural language tools. Experienced performance drops and processing issues can be addressed either by adapting language tools to user generated content or by normalizing noisy texts before being processed. In this article, we propose a Turkish text normalizer that maps non-standard words to their appropriate standard forms using a graph-based methodology and a context-tailoring approach. Our normalizer benefits from both contextual and lexical similarities between normalization pairs as identified by a graph-based subnormalizer and a transformation-based subnormalizer. The performance of our normalizer is demonstrated on a tweet dataset in the most comprehensive intrinsic and extrinsic evaluations reported so far for Turkish. In this article, we present the first graph-based solution to Turkish text normalization with a novel context-tailoring approach, which advances the state-of-the-art results by outperforming other publicly available normalizers. For the first time in the literature, we measure the extent to which the accuracy of a Turkish language processing tool is affected by normalizing noisy texts before being processed. An analysis of these extrinsic evaluations that focus on more than one Turkish NLP task (i.e., part-of-speech tagger and dependency parser) reveals that Turkish language tools are not robust to noisy texts and a normalizer leads to remarkable performance improvements once used as a preprocessing tool in this morphologically-rich language.Hazira

    Turkish Data-to-Text Generation Using Sequence-to-Sequence Neural Networks

    No full text
    TUBITAK-ARDEB [117E977]This work is supported by TUBITAK-ARDEB under the grant number 117E977.End-to-end data-driven approaches lead to rapid development of language generation and dialogue systems. Despite the need for large amounts of well-organized data, these approaches jointly learn multiple components of the traditional generation pipeline without requiring costly human intervention. End-to-end approaches also enable the use of loosely aligned parallel datasets in system development by relaxing the degree of semantic correspondences between training data representations and text spans. However, their potential in Turkish language generation has not yet been fully exploited. In this work, we apply sequenceto-sequence (Seq2Seq) neural models to Turkish data-to-text generation where the input data given in the form of a meaning representation is verbalized. We explore encoder-decoder architectures with attention mechanism in unidirectional, bidirectional, and stacked recurrent neural network (RNN) models. Our models generate one-sentence biographies and dining venue descriptions using a crowdsourced dataset where all field value pairs that appear in meaning representations are fully captured in reference sentences. To support this work, we also explore the performances of our models on a more challenging dataset, where the content of a meaning representation is too large to fit into a single sentence, and hence content selection and surface realization need to be learned jointly. This dataset is retrieved by coupling introductory sentences of person-related Turkish Wikipedia articles with their contained infobox tables. Our empirical experiments on both datasets demonstrate that Seq2Seq models are capable of generating coherent and fluent biographies and venue descriptions from field value pairs. We argue that the wealth of knowledge residing in our datasets and the insights obtained fromthis study hold the potential to give rise to the development of new end-to-end generation approaches for Turkish and other morphologically rich languages.WOS:0009633949000062-s2.0-85152906599Science Citation Index ExpandedarticleUluslararası işbirliği ile yapılmayan - HAYIRNisan2023YÖK - 2022-2

    A benchmark dataset for Turkish data-to-text generation

    No full text
    In the last decades, data-to-text (D2T) systems that directly learn from data have gained a lot of attention in natural language generation. These systems need data with high quality and large volume, but unfortunately some natural languages suffer from the lack of readily available generation datasets. This article describes our efforts to create a new Turkish dataset (Tr-D2T) that consists of meaning representation and reference sentence pairs without fine-grained word alignments. We utilize Turkish web resources and existing datasets in other languages for producing meaning representations and collect reference sentences by crowdsourcing native speakers. We particularly focus on the generation of single-sentence biographies and dining venue descriptions. In order to motivate future Turkish D2T studies, we present detailed benchmarking results of different sequence-to-sequence neural models trained on this dataset. To the best of our knowledge, this work is the first of its kind that provides preliminary findings and lessons learned from the creation of a new Turkish D2T dataset. Moreover, our work is the first extensive study that presents generation performances of transformer and recurrent neural network models from meaning representations in this morphologically-rich language.WOS:000834597200001Scopus - Affiliation ID: 60105072Science Citation Index ExpandedQ3ArticleUluslararası işbirliği ile yapılmayan - HAYIRAğustos2022YÖK - 2021-22Temmu

    An evaluation of recent neural sequence tagging models in Turkish named entity recognition

    Get PDF
    Named entity recognition (NER) is an extensively studied task that extracts and classifies named entities in a text. NER is crucial not only in downstream language processing applications such as relation extraction and question answering but also in large scale big data operations such as real-time analysis of online digital media content. Recent research efforts on Turkish, a less studied language with morphologically rich nature, have demonstrated the effectiveness of neural architectures on well-formed texts and yielded state-of-the art results by formulating the task as a sequence tagging problem. In this work, we empirically investigate the use of recent neural architectures (Bidirectional long short-term memory (BiLSTM) and Transformer-based networks) proposed for Turkish NER tagging in the same setting. Our results demonstrate that transformer-based networks which can model long-range context overcome the limitations of BiLSTM networks where different input features at the character, subword, and word levels are utilized. We also propose a transformer-based network with a conditional random field (CRF) layer that leads to the state-of-the-art result (95.95% f-measure) on a common dataset. Our study contributes to the literature that quantifies the impact of transfer learning on processing morphologically rich languages.WOS:000688460900011Scopus - Affiliation ID: 60105072Science Citation Index ExpandedQ1ArticleUluslararası işbirliği ile yapılmayan - HAYIRKasım2021YÖK - 2021-2

    Türkçe vikipedi için bir XML ayrıştırıcı

    No full text
    Nowadays, visual and written data that can be easily accessed over the internet has enabled the development of research in many different fields. However, the availability of data is not sufficient by itself. It is of great importance that these data can be effectively utilized and interpreted in accordance with the requirements. Access to written content in the Wikipedia encyclopedia, which is becoming increasingly common in Turkish natural language processing, can be done via XML dumps. In this study, our aim is to develop and demonstrate the applicability of an XML parser for the processing of Turkish Wikipedia dumps. The use of the open-source parser, which allows information extraction at different levels of granularity, is reported on pages containing biography infoboxes and textual contents.WOS:000518994300096Scopus - Affiliation ID: 60105072Conference Proceedings Citation Index- ScienceProceedings PaperNisan2019YÖK - 2018-1

    Autoantibodies Against Carbonic Anhydrase I and II in Patients with Acute Myeloid Leukemia

    No full text
    Objective: Cancer, one of the principal causes of death, is a global social health problem. Autoantibodies developed against the organism’s self-antigens are detected in the sera of subjects with cancer. In recent years carbonic anhydrase (CA) I and II autoantibodies have been shown in some autoimmune diseases and carcinomas, but the mechanisms underlying this immune response have not yet been explained. The aim of this study was to evaluate CA I and II autoantibodies in patients with acute myeloid leukemia (AML) and to provide a novel perspective regarding the autoimmune basis of the disease. Materials and Methods: Anti-CA I and II antibody levels were investigated using ELISA in serum samples from 30 patients with AML and 30 healthy peers. Results: Anti-CA I and II antibody titers in the AML group were significantly higher compared with the control group (p=0.0001 and 0.018, respectively). A strong positive correlation was also determined between titers of anti-CA I and II antibodies (r=0.613, p=0.0001). Conclusion: Our results suggest that these autoantibodies may be involved in the pathogenesis of AML. More extensive studies are now needed to reveal the entire mechanism
    corecore